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WOODROOFE'S ONE-ARMED BANDIT PROBLEM REVISITED^ 

By Alexander Goldenshluger and Assaf Zeevi 

University of Haifa and Columbia University 

We consider the one-armed bandit problem of Woodroofe [J. 
Amer. Statist. Assoc. 74 (1979) 799-806], which involves sequential 
sampling from two populations: one whose characteristics are known, 
and one which depends on an unknown parameter and incorporates 
a covariate. The goal is to maximize cumulative expected reward. We 
study this problem in a minimax setting, and develop rate-optimal 
polices that involve suitable modifications of the myopic rule. It is 
shown that the regret, as well as the rate of sampling from the infe- 
rior population, can be finite or grow at various rates with the time 
horizon of the problem, depending on "local" properties of the covari- 
ate distribution. Proofs rely on martingale methods and information 
theoretic arguments. 

1. Introduction. 

Background and motivation. In his landmark paper, Robbins (1952) in- 
troduced an important class of sequential allocation problems, known collec- 
tively as multi-armed bandit problems. These models have since played cen- 
tral roles in areas such as statistics, operations research, engineering, com- 
puter science and economics. Berry and Fristedt (1985) and Gittins (1989) 
are standard references on the subject, and Lai (2001) and 
Cesa-Bianchi and Lugosi (2006) provide recent overviews of this voluminous 
literature. 

The basic two-armed bandit problem can be described as follows. Consider 
two statistical populations. At each point in time, a single observation from 
one of the two populations can be taken, and a random "reward," governed 
by the properties of the sampled population, is realized. The objective is 
to devise a sampling policy that maximizes the expected cumulative (or 
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discounted) reward over a designated time horizon. A particular case of 
interest arises when the probabihty distribution of one population is known 
a priori. This setting is often referred to as an one-armed bandit problem. 
Motivation for bandit problems can be found, among other examples, in 
clinical trials. Here the objective is to test two (or more) treatments and 
eventually allocate the one with greater efficacy to incoming patients; see 
Lai (2001) for further discussion and references therein. 

The bulk of literature on multi-armed bandits assumes that the sampled 
populations are homogeneous. However, in many practical situations some 
additional information in the form of a covariate can be utilized for alloca- 
tion purposes, and the reward distributions may depend on this covariate. 
For example, in the clinical trials context before deciding to assign a given 
patient to a treatment we can observe a covariate such as age or severity of 
disease. The one-armed bandit problem with covariates was first addressed 
in the pioneering work of Woodroofe (1979), who introduced and studied 
the following model. 

Woodroofe's one-armed bandit problem. Let {Xt,Y^^\Y^^\t > 1) denote 
a sequence of random vectors, where Xt is a covariate value at stage t, and 
Y-^^^ is a potential reward from the arm i = 0, 1 that can be obtained at stage 
t. It is assumed that: 

(i) the conditional distribution of Y^^^ given X is known, while the con- 
ditional distribution of Y^^^ given X depends on an unknown parameter 

0; 

(ii) for any given value of the parameter 6, {Xt,Y^^\Yl'^'^) are indepen- 
dent and identically distributed (i.i.d.) copies of (x,y(o),y(i)). 

Suppose that Xi, . . . , Xt, . . . are observed sequentially over time, and at 
each stage t we can observe either Y^^^ or Y^^^ , but not both. Let Trj = or 
1 according to whether Y^^^ or Y^^^ is observed at stage t; we will refer to 
this as sampling of arm or arm 1, respectively. Then the objective is to 
develop a sampling policy vr = (vrj,t > 1) such that the expected value of the 
total geometrically discounted reward 

oo 

v*=Y.[p'~\-tYl'^ + {i-^t)yl'^)] 
t=i 

is maximized. Here p E (0, 1) is a discount factor. By policy we mean a 
sequence vr = (vrj, t > 1) of random variables taking values in {0,1} such 
that TTt depends on the observations collected up until time t. 

Woodroofe (1979) considered the outlined problem in the Bayesian set- 
ting under the assumption that Y^^^ = X — 9 -\- e, where e is a zero mean 
random variable with known distribution, independent of X. For a given 
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prior distribution of 9, Woodroofe (1979) provided a (nonconstructive) de- 
scription of the optimal Bayesian policy. It was also shown that the myopic 
policy, which selects arm 1 when the value of X is greater than the current 
estimate (posterior mean) of 9, is asymptotically optimal as p tends to 1. 
These results were later extended by Sarkar (1991) to a slightly more general 
setting. 

Summary of results. In this paper we revisit Woodroofe's one-armed 
bandit problem with covariates, but with several notable distinctions. 

We consider a minimax (non-Bayesian) framework with finite horizon 
n; see Section 2 for details. This is more in line with the formulation in 
Robbins (1952), and the seminal work of Lai and Robbins (1985) that de- 
veloped asymptotically-optimal policies for traditional multi-armed bandits. 
The performance of a policy vr is measured relative to the oracle policy 
TT* = (vTjjt > 1) that "knows" the unknown parameter 9 and at each step, 
given the covariate value X, selects the arm with highest expected reward. In 
this context the regret and the inferior sampling rate are natural policy per- 
formance measures. The regret refers to the loss in the expected cumulative 
reward that stems from the use of a given policy relative to the oracle policy. 
The inferior sampling rate is the expected number of wrong arm selections 
prescribed by the policy. Assuming that the distribution of {X, Y^^'^) belongs 
to some natural class V of joint distributions, we measure performance of a 
policy vr by the maximal regret and inferior sampling rate over the class V. 

In this work we study minimax complexity of the one-armed bandit prob- 
lem with covariates. We establish explicit nonasymptotic lower bounds on 
the minimax regret and inferior sampling rate (see Section 3.3, Theorems 3 
and 4) and develop simple and intuitive rate-optimal policies which achieve 
these bounds in the sense of the order (see Section 3, Theorems 1 and 2). 

Our work highlights a key property of the bandit problem with covariates: 
the performance of any policy depends critically on the behavior of the co- 
variate distribution in the vicinity of the "decision boundary" x = 9 (see Def- 
inition 1). This is akin to the Tsybakov margin condition that plays a pivotal 
role in nonparametric classification problems [see Mammen and Tsybakov 
(1999) and Tsybakov (2004a)]. In particular, depending on this margin con- 
dition, there are three distinct "regimes" : one where it is possible to achieve 
a finite regret as n — > oo; one where the regret grows like Inn; and one where 
the regret grows like a fractional power of n (see Remarks 3 and 4). These 
cases correspond to natural classes of distributions. 

It is worth pointing out that the rate-optimal policies developed in this 
paper are not myopic. To that end, we were not able to prove that the myopic 
policy is rate optimal in our setting. This issue is discussed in Section 3.4, 
where our results are compared with those of Woodroofe (1979) and Sarkar 
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(1991). The numerical results in Section 4 lend further credence to this point 
by illustrating the inferior performance of the myopic policy relative to the 
two policies proposed in this paper. See also further discussion in Section 4. 

Further related literature. In contrast to the voluminous literature on 
traditional multi-armed bandits, the number of papers that address bandit 
problems with covariates is rather limited. We refer to Woodroofe (1982), 
Clayton (1989), Yang and Zhu (2002), Wang, Kulkarni and Poor (2005) and 
Goldenshluger and Zeevi (2008), where further references can be found. 

The rest of the paper is organized as follows. Section 2 contains the prob- 
lem formulation and the main definitions. In Section 3 we introduce two poli- 
cies, establish upper bounds on their performance and derive lower bounds 
on the minimax regret and inferior sampling rate. Section 4 presents numer- 
ical results, and proofs of all results are given in Section 5. 

2. Problem formulation. 

The model. Assume that a sequence of i.i.d. random variables Xi,X2, . ■ . 
with common distribution Px is observed sequentially over time. At each 
stage t, one can allocate the covariate Xt to one of two response models 
obtaining random rewards Y^'^'^ and Y^^\ respectively. The random vectors 
(Xt,y/°\y/^^) are i.i.d. copies of (X,y(o)^y(i))^ ^nd the conditional distri- 
bution of y(o) given X is known. Allocation of Xt to the ith arm (i = 0, 1) 
gives rise to a response (reward) y}^^ as follows: 

(2.1) y/°) = o, y}^^ = Xt-e + et, t = l,...,n, 

where 6 is an unknown parameter, and {ej} is a sequence of i.i.d. AA(0,cr^) 
random variables, independent of the sequence {Xt,t > 1). As in Woodroofe 
(1979), the assumption Y^^^ = does not restrict generality. Since the regret 
depends linearly on the observed rewards [see (2.2)], the reduction to Y^^'^ = 
is achieved by considering Y = Y^^^ — K{Y^'^^\X) instead of Y^^^; thus, we 
always write Y instead Y^-^^ . Here and in what follows all random variables 
are assumed to be defined on the common probability space {i},J^,¥), and 
E stands for the expectation with respect to P. 

Policies and performance measures. By a policy tt = {Ttt,t > 1) we mean 
any sequence of random variables taking values in {0, 1} such that ttj is 
jFjt^-measurable; here J^t~-i cr-field generated by the data collected up 

until time t — 1, and by the current value of the covariate Xt, that is, 

■= o"(Xi, . ..,Xt,Xt+i;'!Ti, . . . ,nt;TtiYi, . . .,7rtYt). 

We also denote J='t := a{Xi, . . . ,Xt;Tri, . . . ,Tft;TriYi, . . . ,TftYt). 
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The quality of a policy tt = {7rt,t > 1) is measured relative to the perfor- 
mance of the oracle tt* = {TT^,t > 1) which "knows" the value of the param- 
eter 6 a priori, and at each stage t prescribes 

tt; := Xt) =I{Xt>e}, t=l,2,.... 

The regret Rn{TT;9) is defined as the difference between the expected to- 
tal rewards accumulated by the oracle vr*, and the expected total reward 
generated by vr over a horizon n: 

n n n 

(2.2) Rnin, vr*) := vr^y* -^Y^^tYt = jX* - e\I{7Tt / vr;}. 

t=i t=i t=i 

Another performance characteristic of a policy vr is the inferior sampling 
rate defined by 

n 

Sni7T,7r*) :=E[rinf(n)] =^P{7rt/<}, 

t=i 

where Tj^f (n) = J2t=i ^i'^t 7^ T^t} is the total number of times the policy vr 
sampled the inferior arm. 

In this paper we adopt a minimax approach. Let "P be a class of joint 
distributions Px,Y of {X, Y); then the quality of a policy vr will be measured 
by the maximal regret Rni^^, V) = supp^ ^.^-p Rn{Ti", vr), and by the maximal 
inferior sampling rate ^^(vr;^) = supp^ ^g^p ^^(vr, vr*). The minimax regret 
and the minimax inferior sampling rate are defined by 

R*^{V) :=infi?„(vr;P), S*^{V) := inf 5„(vr;P), 

TV TV 

where inf is taken over all possible policies. The policy vr* is said to be rate 
optimal with respect to the class V if 

r ^n(vr*;P) ^ , .. M^*|P) 

hmsup — — < oo and limsup — — < oo. 
n— ►oo Rr^yP) n-^QO o^yP) 

In this paper we develop rate-optimal policies and study the behavior of 
the minimax inferior sampling rate and regret for some natural classes V of 
joint distributions Px,Y- 

Classes of joint distributions V . It turns out that the complexity of the 
one-armed bandit problem with covariates (as measured by the minimax 
inferior sampling rate and the minimax regret) is essentially determined by 
the behavior of Px near the "decision boundary" x = 9. This behavior can be 
quantified by a condition similar to the so-called Tsybakov margin condition 
in classification [cf. Mammen and Tsybakov (1999), Tsybakov (2004a)]. 

The joint distribution Px,y can be described in terms of the conditional 
distribution Py\x the marginal distribution Px- According to (2.1), the 
conditional distribution of Y given X is Gaussian with mean X — 9 and 
variance o"^, that is, M{x — 9,a'^). 
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Definition 1. We say that Px,y G 'Pa{0) ii Py\x=x =M{x-9,a'^), and 
there exist constants > 0, a > 0, G (0, p G (0, 1) such that 

(2.3) Px{[0 -x,e + x\}<C^x'' VxG(0,xo] 
and 

(2.4) 0<p<Px{[^,oo)}<l. 
Let G C and set 

Vc.{Q) = U ^-W- 
See 

We write Va = 'Pq((— oo, oo)). 

Several remarks on the above definition are in order. 
Remark 1. 

1. In what follows we omit in the notation explicit dependence of Va{0) on 
parameters C*,rEo and p. Without loss of generality, we suppose that the 
constants C^,, xq and p are all the same for all classes Va{G)- Note also 
that the parameters C^=, xq, a and p are related to each other; in what 
follows we assume that 

(2.5) pi:=p-C,x^>0. 

2. Condition (2.3) describes the behavior of Px near the "decision bound- 
ary" X = 9. The most important and typical case is that of a = 1: if X 
has a density fx w.r.t. the Lebesgue measure separated away from zero 
in the vicinity of x = 6', then Px,y G Vi{d). If fx ^\x — a > for 
X close to 9, then Px,y G VaiO). The case of a = oo corresponds to a 
distribution of X that assigns zero probability to the xq- vicinity oi x = 6. 

3. Condition (2.4) ensures that the oracle policy samples both from arm 
and arm 1 with positive probability. In the absence of this condition, the 
problem is reduced to the setting with no covariates. 

For the purpose of having a well-defined regret, we also consider the fol- 
lowing restriction T"^{Q) of the class Va{Q)- 

Definition 2. Assume that / \x\Px{dx) < oo. For /_f > 0, let 

K{0) := Vaie) n |px,y : 1 |x - e\Px{dx) < ^1 

and VU&) = [JeeeK{0). We write = ^^((-oo, oo)). 
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3. Main results. First we introduce some notation. For a policy vr = 
(TTt, t> 1), and t = 1,2, . . . , let T^(t) = J2i=i denote the total number of 
times up until time t that vr sampled from the arm 1. The estimator of 9 
based on the observations collected up until stage t under the policy vr is 
defined by 

1 * 

(3.1) e^it) = -—J2{X,-Ys)7r,. 

3.1. Nearly-myopic policy. Consider the following policy vr. Set tti = 1, 
and for t = 1, 2, . . . , define 

(3.2) *,„.^{x„.>4W-^}. 

where 6 = {St,t > 1) is a sequence of positive real numbers to be specified. 
If St = 0, for all t in (3.2), then the corresponding policy is myopic, as it 
mimics the oracle policy vr* by plugging in the current estimate of 9. 

The next theorem establishes nonasymptotic upper bounds on the maxi- 
mal inferior sampling rate and the maximal regret of the policy vr. 

Theorem 1. Let vr = (vr^,* > 1) denote the policy given by (3.1)-(3.2) 
and associated with 6t = 2cT\/31nt. Let to := min{t G {1, . . . , n} : xoy^ > 
8(7 \/3 Int} and define 



(3.3) t„ :=min{iG {l,...,n}:t> (8^3(7 -\/lnt/p)^°/(""^)} Va > 2. 
(i) For all a > 0, 



(3.4) Sn{7T;Va)<{toV2) + C,Y^^\^8V3a^^j +K. 
Furthermore, ifa>2, then 

AC 

(3.5) S„(vr;P,)<(toVt„V2) + *- + K. 

a — 2 



(ii) For all a > 0, 



(3.6) Rniir; V'^) < /i[(to V 2) + K] + C7, V sVSaJ — 

f^A V 

In addition, if a> 1, then 

AC 

(3.7) i?„(vr;p;)</i[(toVWiV2) + K]+ * 



a — 1 
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The constant K = K{p) appearing in (3.4)-(3.7) depends on p only; its 
exact expression is given in the proof of the theorem. 



Remark 2. The bounds (3.4) and (3.6) are too conservative for large a. 
In particular, they are not applicable to the case a = oo. On the other hand, 
(3.5) and (3.7) provide upper bounds on the inferior sampling rate and the 
regret in the important case of a = oo. In particular, too ^ c[((T^p~^) ln((Tp~"^/^)]^ 
for some constant c. 



Remark 3. The immediate consequence of Theorem 1 is that 

r C, a > 2, 

(3.8) 5„(7r;P„)< i C(lnn)2, a = 2, 

[ Cni-°/2(lnn)°/2, < a < 2, 

{C, a>l, 
C(lnn)2, a=l, 
Cn(i"")/2(lnn)(i+")/2, < a < 1, 

where C depends on parameters of the class Va (resp., V'^). Thus, the max- 
imal inferior sampling rate of vr is finite when a > 2. Similarly, the maximal 
regret is finite when a > 1. On the other hand, both the maximal inferior 
sampling rate and the maximal regret diverge to infinity when a < 2, and 
a < 1, respectively. 



A natural question then is whether there exists a policy with slower growth 
rates for the inferior sampling rate (3.8) when a < 2, and for the regret (3.9) 
when a < 1 . 



3.2. Forced sampling policy. Let g > be a design parameter to be spec- 
ified. Define the sequence Tq = [jt'.t > 1) of positive integers by ri = 1, 
Tt = [exp{gt}J , t > 2. The number of elements A^o(i) of the sequence Tq that 
are less than or equal to t satisfies the following inequalities: 

- Int - 1 < A^o(i) < - ln(t + 1). 

Consider the subsequence T of Tq containing all nonequal elements of Tq. It 
is easily seen that if 

(3.10) t>.:=l + iw(-A_), 

then Tt — Tf-i > 1; here ln+(-) = max{ln(-), 0}. Therefore, if G Tq and t'>v, 
then also n G T. For all t, let N{t) := J^reT < t}- Then N{t) < No{t) 
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for all t, and 

(3.11) N(t)<No(t)<-ln(t + l) Vt, 

q 

(3.12) N{t)>No{t)-No{u)>-ln(-^) 

q Vi^ + i/ 

Now we define the policy vr = {TTt,t > 1) in the following way: set 

(3.13) TTt 



yt > u. 



fl, teT, 
1 1{Xt > h{t - 1)}, otherwise, 



where 9n{t) is given in (3.1). Thus, policy vr incorporates forced sampling 
from arm 1 at time instants T, and myopic action in between. Under the 
policy vf, arm 1 is pulled at least N{t) times up until time t. 
Let v be given in (3.10), and let 

(3.14) 1^0 := max{i/, min(i : t > 2g~^ ln{t +1))}. 



Theorem 2. Let vf = (vff,* > 1) be the policy defined in (3.13) and as- 
sociated with parameter q > 0. Then for all n > uq and any class Va/V^ 
satisfying Xq > 12gcj^, one has the following: 

(i) For any a> 0, 
(3.15) 5„(7f;P„)<ax„^ W— +-ln(n+l) + ^^ + Ki, 



t=i 



Pit I q x^pi 



w/iere := [(a/2)"/2(i _ 2"")-! +r(a/2)(21n2)"i], and Ki is a constant 
depending on parameters of the class Va, o"^ and q, but independent of a 
and Ctf. Furthermore, ifa>2, then 



1 r 1 /32rT2\ 

(3.16) 5„(vf;P«) < - ln(n + 1) + (^) 



where K2 is an absolute constant. 
(ii) For any a> 0, 




2 



(3.17) 

ri , ^, 16(a + l)a^ 1 
+ - In n + 1 + ^ — — + Ki }. 
I q xipi J 
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Furthermore, if a > 1 , then 
(3.18) 



i2n(^;n)<-ln(n+l) 



The exact expressions for constants Ki and K2 are given in the proof of the 
theorem. 

Remark 4. 

1. Note that the statement of the theorem holds only for classes Va (resp., 
V'a) such that Xq > 12qa^ . This is in contrast to the policy vr for which 
the results of Theorem 1 hold for classes Va and V'a with arbitrary pa- 
rameters. Thus, the smaller the design parameter q, the larger the class of 
joint distributions for which the theorem statement is valid. Note, how- 
ever, that the regret and the inferior sampling rate grow as q decreases. 

2. An immediate consequence of the theorem is that 



Sn{T^\Va) < 



Rn{TT'-,V'a) < I ^^(i_'q,)/2 



Clnn, ct>2, 
Cni-°/2, 0<a<2, 

f Clnn, a > 1, 



< a < 1. 



3. Comparing the above bounds with (3.8) and (3.9), we conclude that the 
forced sampling policy vr is better than the nearly-myopic policy tt in the 
zone of "small" a (a < 2 for the inferior sampling rate, a < 1 for the 
regret). However, the inferior sampling rate and the regret of vr grow at 
least logarithmically for all a, so vr is better in the zone of "large" a. 
We were not able to develop a single policy that simultaneously shares 
properties of tt for large a and vr for small a. 

3.3. Lower bounds. Theorem 1 shows that S'„(7r; Pq) is finite when a > 2; 
likewise, ii„ (7r;'P^) is finite when a> 1. The next theorem establishes lower 
bounds on Sn{T^;Va) and Rn{TT;V'a) when a<2 and a < 1, respectively. 

Theorem 3. For any policy tt and large enough n, one has 
Sn{TT;Va) > l( C^a'^u^-''/^ VaG (0,2], 
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Theorem 3 shows that the forced samphng pohcy n is rate optimal with 
respect to the inferior samphng rate when a £ (0, 2) and with respect to 
the regret when a £ (0,1). At the same time, in the zone of smah a the 
performance of the nearly-myopic policy vr is worse than the minimax rate by 
a logarithmic factor (see Theorem 1). Note, however, that in the "boundary" 
cases a = 2 and a = 1, there is a logarithmic gap between the lower bounds 
of Theorem 3 and the upper bounds of Theorem 2. This raises the question 
whether the forced sampling policy is rate optimal with respect to the regret 
(inferior sampling rate) whenever a = 1 (a = 2). Next we show that, for a 
wide class of admissible policies, the performance of the forced sampling 
policy cannot be improved upon. 

Let n denote the class of policies vr = (vrt, t>l) of the form vrt = I{Xt > 
7t}, where jt is an .7^t_i-measurable random variable. We note that the 
class n is sufficiently rich and include policies with forced sampling (set, 
e.g., = ±oo). We have the following result. 

Theorem 4. Let C M-"^ be a closed bounded interval; then for all n 
large enough, one has 

inf Sn{7r;V2{@)) > Kia^lnn, 
inf Rn{7r;V[{e)) > K2(7^lnn, 

TrSn 

where Ki, K2 are absolute constants. 

Thus, Theorem 4 establishes that in the "boundary" case (a = 1 and 
a = 2, for the regret and inferior sampling rate, resp.) the forced sampling 
policy TT cannot be improved in the sense of the order in the class of policies 

n. 

3.4. Discussion. The upper bounds established in Theorems 1 and 2 
demonstrate that a finite regret can be achieved concurrent with an inferior 
sampling rate that grows to infinity. This is a rather obvious characteristic of 
the bandit problems with covariates: wrong arm "pulls" may incur a small, 
even negligible, loss in terms of rewards. In contrast, in traditional multi- 
armed bandit problems the regret and inferior sampling rate are identical 
up to a constant multiplier. 

Woodroofe (1979) and Sarkar (1991) establish asymptotic optimality of 
the myopic policies in the Bayesian setting. In contrast, the rate-optimal 
policies developed here are nonmyopic; we were not able to show that the 
myopic policy is rate optimal in our setting. We believe the explanation for 
this lies in the following assumptions made in the aforementioned papers: 
Woodroofe (1979), Conditions CI and C2, and Sarkar (1991), Conditions 
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5a and 5b. These assumptions impose that the prior distribution for 9 is 
supported on an interval G, while the covariate X has a positive continuous 
probability density on the real line. Therefore, with positive probability X 
takes values outside 0, and for such covariate values it is exactly known 
which arm is superior (in expectation). This assumption ensures that for 
every t with large probability the myopic rule samples 0{t) times from the 
arm 1 (cf. Lemmas 2 and 4). Note that the conditions (2.3) and (2.4) are 
much less restrictive. With the extra assumption made in Woodroofe (1979) 
and Sarkar (1991), one can establish optimality of a myopic policy, but 
it is worth pointing out that these assumptions simplify significantly the 
exploration-exploitation trade-off that underlies the design of "good" poli- 
cies. 

4. Numerical results. This section describes the results of a small simu- 
lation experiment that illustrates behavior of the policies presented in Sec- 
tion 3, and compares them with the myopic policy. 

The conditional distribution of Y given X = x \s assumed to be Gaus- 
sian with mean x and variance 1; hence, ^ = 0. The following two setups 
are considered: (i) Xt are i.i.d. random variables uniformly distributed on 
[—1,1]; (ii) Xt are i.i.d. random variables taking values ±1 with proba- 
bility 1/2. The former corresponds to a case where a = 1, and the latter 
to a case where q = oo. In each setup we compute the inferior sampling 
rate and the regret of the three policies (myopic, nearly myopic and the 
one involving forced sampling), when the horizon n takes values in the 
set {250,500,750,1000,2000,2500,3000,4000,5000}. In our simulations the 
nearly-myopic policy is implemented with 5t = \/lnt, while the forced sam- 
pling policy uses q = 1/12; see the conditions of Theorem 2. For each n we 
compute the inferior sampling rate and the regret, averaged over 500 runs. 



ONE-ARMED BANDIT 



13 



The results are summarized in Figures 1 and 2. Figure 1(a) shows the 
inferior samphng rate of the nearly myopic, forced sampling and myopic 
policies averaged over 500 runs, while Figure 1(b) displays the corresponding 
averaged regret. It is clearly seen that when the covariates Xt are uniformly 
distributed, the forced sampling policy has the smallest averaged inferior 
sampling rate and regret. The nearly-myopic policy also outperforms the 
myopic policy which has the largest average inferior sampling rate and regret. 
Figure 2 corresponds to setup (ii) where a = oo. Because Xt are i.i.d. random 
variables taking the values ±1 and = 0, the inferior sampling rate coincides 
with the regret. That is why in Figure 2 we present only the graph of the 
logarithm of the average regret. The numerical results show that the nearly- 
myopic policy is preferable under setup (ii), consistent with the theoretical 
results of Section 3. 

Even though the myopic policy appears to be inferior in comparison with 
the nearly-myopic and forced sampling policies, the results in Figures 1 
and 2 do not clarify the reasons for such behavior. Additional insight into 
performance of the three policies can be gained from the graphs in Figure 3. 
For n = 2000 and under conditions of setup (i). Figures 3(a) and (b) display 
the boxplots of the inferior sampling rate and the regret obtained in 500 
runs. It is clearly seen that the average performance of the myopic policy 
is badly affected by a large number of runs with poor performance. The 
nearly- myopic policy is the most stable over the different runs, though its 



- - Nearly myopic 




500 1000 1500 2000 

Fig. 2. Setup (ii).- Xt are i.i.d. random 
The logarithm of the regret averaged over 



2500 3000 3500 4000 4500 5000 
Horizon 

variables taking values ±1 with probability 1/2. 
500 runs. 
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Fig. 3. Setup (i); Xt are i.i.d. uniformly distributed on [—1, 1]. (a) Boxplot of the inferior 
sampling rate computed over 500 runs; (b) Boxplot of the regret computed over 500 runs. 



average performance is worse than that of the forced samphng policy for 
this covariate distribution. 

5. Proofs. 

5.1. Preliminary lemma. 

Lemma 1. For any policy vr, any measurable set A, and any x > 0, 
¥{\9^{t)-e\>x,A} 

(5.1) 

.2 ^ ^ ^1/2 

< 2 

In particular, 



Eexp|-^r^(t)|l{|^,(t)-0| >x,^} 



(5.2) P{|^^(t) -0| >x,r^(t) >r} <2exp|-|-j| Vx,r>0. 

Remark 5. The proof shows that (5.1) holds when x is a positive ran- 
dom variable. 

Proof. The inequalities (5.1) and (5.2) follow immediately from re- 
sults in de la Pefia, Klass and Lai (2004) [see also de la Pefia, Klass and Lai 
(2007) and Liptser and Spokoiny (2000)]. We provide a proof for complete- 
ness. 

Note that e^{t) - 9 = -(Es=i ^s)~^ Es=i e^Tr^- Write, for brevity, S = 
{e - 9^{t) >x,A}. Then for any A > 0, 

P{5} =E/|exp^A^^e,7r, - XxT^{t)J > l|/{5} 
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(5.3) 

< E exp I A e, - \xT^ (t ) | /{ S"} . 

Let 

Mt(A) =exp|A^^e,7r, - -^T^{t)\- 

then [Mt{\),J-i) is a martingale for any A, and EMt(A) < 1 for all t. There- 
fore, it follows from (5.3) that 



*{5}<E|^exp|A^^( 



EsTis - AV^T,(t)| exp{(AV^ - \x)T^{t)}I{S} 
(5.4) < y'EMt(2A)(Eexp{2(AV2 - \x)T^{t)]I{S]f''^ 

.2^1 1/2 



< 



Eexp|-^r^(t)|/{5} 



where the second inequality is obtained using the Cauchy-Schwarz inequal- 
ity, and the third one is by minimization over A > 0. Applying the bound 

(5.4) for the random variable —{0 — 0.,^{t)), we complete the proof of the 
lemma. □ 

5.2. Proof of Theorem 1. We begin with the following lemma. 

Lemma 2. Let (2.4) hold, and assume that the sequence 5 = {St,t> 1) 
is nondecreasing. Then for any z G (0, |] and t>2, 

(5.5) P{T^(t) < zpt} < exp|-^p2^2^| + 4ztexp| ^^^J^}- 

Proof. Denote (s = T^^{s) Ej=i ^i^j- Then 

t 

r^(t) = l + 5^7r, 

s=2 

t 



= 1 + ^ > e^{s) - 6,T7y\s)} 

s=2 

t 

= l + Y. ^{^^+1 > ^ - - SsT7'^\s)} 

s=2 
t 

>i + Yl > e}i{\Cs\ < 6sT;'^\s)}. 

s=2 
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— 1/2 

Denote Ws = /{|Cs| < ^sT^ (•s)}; then 

^Tnit) < zpt} < p|^I(X,+i > 6)ws <zpt-l^ 

< I{Xs+i > 0)ws < zpt -l,^Ws> 2zt\ 

U=2 s=2 J 

= :.h + J2- 

Let p' = Px{[0.,oo)}] it follows from (2.4) that p' >p. Note that Wg is 
.7^s_i-measurable; hence, (X]s=2[p' ~ -^{-'^s+i > ^}]w^s5-^s) is the martingale 
with bounded differences. Then by the Azuma-Hoeffding inequality [see, 
e.g., Cesa-Bianchi and Lugosi (2006)], 

{t t t \ 

J2[P - H^s+i > 0)]ws > p' ^ - {zpt -l),J2w,>2zt\ 
s=2 s=2 s=2 ) 

Now we bound J2. For this purpose we note that 

{i:n\Cs\ > SsT7'/\s)} > (1 - 2z)t\ C y {|C,| > 6,Tr'/\s)}. 

U=2 J s=[(l-2^)tj 

Therefore, 



-^2 = ^{l^/^l'^^l > ^sT7'/\s)} > (1 - 2z)t| 



< J2 ms\>SsT7'/\s)} 

s=lil-2z)t\ 

^2 E exp{-|,}<4.,exp{-%«}, 

where the second inequality follows form Lemma 1, and the third by mono- 
tonicity of {6t). Combining this inequality with the upper bound on Ji, we 
complete the proof. □ 
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Now we turn to the proof of Theorem 1. 
Proof of Theorem 1. (i) Fhst we prove (3.4). Define 
A = 



Therefore, 

n n—1 

Tinf (n) = I{f^t / < 1 + ^ I{Xt+i e Dt} 
t=i t=i 

and 

n-l 

E[TM(n)] < 1 + E e A} 

(5.6) 

< 1 + ^[P{Xi+i G A,T^(i) >pV4} +IPm(t) <pt/4}]. 
t=i 

Applying Lemma 2 with z = 1/4, we have 

(5.7) P{r^(t)<^t/4}<exp|-^|+texp|-^|. 

Furthermore, for any sequence (74) of positive random variables such that 
7t is .Ft_i-measurable, one can write 

P{Xt+iG A,T^(i)>W4} 

= ¥{Xt+i G Dt,n{t) > pt/4, \e^{t) - 6tT7^'\t) -e\<^t] 

+ nXt+i G DuT^{t) > pt/A, \e^{t) - 5tT7^'\t) -e\> 7*} 
= :Pi{t)+P2{t). 

— 1/2 

Setting 7i = 2(5tT~ ' (t) and using the definition of Dt and (2.3), we obtain 

-^.r.,<)..v4}.a(i|)", 

provided that 5t < XQ^/pt/A:. By Lemma 1, 

P2{t) < P{T^(t) > pt/4, \e^it) - 5tT7^'\t) -e\> 25tT7^'''{t)] 
(5.9) < nn{t) > pt/A, \e^{t) -e\> 8tT7^'^{t)} 



(5.8) Pi(t) < p||Xt+i -e\< -^= T^(t) >pt/A^ < a 



, ^2 

< 2exp 
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Now we set St = 2cr\/3 Int; with this choice (5.9) and (5.7) imply that 
P2{t) < 2t~^ and 

F{T^{t) <pt/i} <exp{-p'^t/32} +8t~'^ Vt > 2. 

Let to = toiP^f^^XQ) := min{t : xo-v/pt/4 > 2cj\/31nt}; then it follows from 
(5.8) that 



yt>to. 



It is easily seen that to ^ ci(T^(pxq) ^ ^ \ln[a^ (px^) ^]| for some absolute 
constant ci. Combining these inequalities with (5.6), we obtain 



n-l / 

E[rinf(n)] <(toV2) + C, J2 Sv^fj 



t=toV2 




Int 
pt 



n— 2 n— 1 

+ 8 ^ t-2 + 2 ^ J2 exp{-pV32}. 

i=ioV2 t=toV2 t=toV2 

This completes the proof of (3.4). 

Now we prove (3.5). Assume that a > 2 and let ta be given by (3.3). 
Clearly, 



ta < C2[(8V3c7p-i/2)y'ln(8V3(jp-i/2)]4"/("-2) 
for some absolute constant C2. We can write 

n 



4a 
a - 2 



2a/(a-2) 



t=ta 



Each summand on the right-hand side of the above formula is bounded from 
above exactly as before. This leads to the following bound: 

E[rinf(n)]<(toV2Vt„)+a h^^V^ 

n—2 n— 1 

(5.10) +8 + 2 E 

t=toV2Vta t=toV2Vta 

n-1 

+ 5] exp{-/t/32}. 

t=toV2Vta 
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By definition of t^, (8^30-^111*7^)° < p/'^"^/^ fo^. t > hence, 
V^Y«/^ /l^V "^V8V3a\" (lntr/^ 1 

< V- ^-(a+2)/4 < 4n 

~ 4^ ~ a - 2 ■ 

t — to- 

Tliis bound along with (5.10) leads to (3.5). 

(ii) The proof of the second statement goes along the same lines. 
We have 

n 

Y^\Xt-e\i{%t^<} 

t=i 

n-l 



<\Xi-e\ + Y^ \Xt+i - e\i{Xt+i e A} 
t=i 

n-l 

<\Xi-e\ + Y^ \Xt+i - e\I{Xt+i G Dt,T^it)>pt/4} 



t=l 

n-l 



+ J2\Xt+i- e\I{T^{t) <pt/4} 



t=l 



=:\Xi-e\ + Ji + J2. 

Since K\Xt — 0\< fj, for Px,y £ 'P'a^ ^-^id ^t+i is independent of T^(t), we 
have by (5.7) 



exp< > + t exp 



32 J "^l 4(j2 



£=1 

Furthermore, 

n-l 

E[Ji] = ^E|Xt+i - e\I{Xt+i G A,T^(i) >pt/4, 

mt)-6tTr'/\t)-e\<jt} 

n-l 

+ ^ ElX^+i - e\I{Xt+i G A, (i) > 

|4(t)-'^*rri/2(i)-^l>7a 

^A(t) + E^2(i)- 



n— 1 n— 1 



t=l t=l 
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— 1/2 

Setting as before 74 = 26tT^ (t) and using (5.8), we obtain 

\Xt+^ -e\< ^j=^^nt) > pt/^\ < (^-^ j , 

provided that 6t < xo^/pi/4:. In addition, 

E2{t) = E\Xt+i - e\I{Xt+i G Dt,T^{t) > pt/A, \6^{t) - 5tT7^'\t) -e\> ^t} 
< nXt+i - 0\I{T^{t)>pt/4, \9^{t) - 6tT7'/\t) -6\> -ft} 

<2,exp{-^}. 

Combining these inequalities, we come to (3.6). The bound (3.7) is obtained 
using the same reasoning as in the proof of (3.5). 

5.3. Proof of Theorem 2. The next result follows from Lemma 1 by set- 
ting A = Q and taking into account that T^-(t) > N{t), Vt. 

Lemma 3. Under policy vf for any x > and any t>l, one has 
P{|4(t) -e\>x} < 2exp|-^iV(t)|. 

In particular, it follows from (3.12) and Lemma 3 that 

(5.11) P{|0^(t)-e|>x}<2exp|^j(^--^j yt>u, 

where u is given in (3.10). 

Lemma 4. Let Px,y G 'Pa o.nd pi :=p — C^Xq,- then for any 2; G (0, 1/4] 
and all t> u, 

^{Ti{t)<zpit} 

<exp|-ip2^2(^_^(^))| 

+ 2 expj ^ [1 + ln(z. + 1)] |ti--o/(49-^) . 

Proof. Fix t>v. Denote Ci = jrj^ T^]=i ^j^j- Then 

t 

T^{t) = N{t) + Y,nsI{seT'^} 
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t 



Nit) + I{^s+i > Oiis)}I{s G T^} 

s=l 

t 

Nit) + i{Xs+i > e - Cs}I{s G T^} 



s=l 

t 



> Nit) + J2 HXs+1 > + xo}I{\Cs\ < xo}I{s G T'}. 



s=l 



Denote Ws = I{\Cs \ < xo}I{s gT''}. Then 
¥{T^it) < zpit} 

< ^^J2HXs+l > + xo}ws < zplit - iV(t))| 

< >e + xo}ws < zplit - Nit)),J2 Ws > 2zit - Nit))\ 

ls=l s=l J 

+ r\^Ws<2z{t-Nit))^ 
=:Jiit) + J2{t). 

Let p[ = F{Xs+i > 9 + xo}; it follows from the definition of Va [cf. (2.5)] that 
Pi >Pi>0. Note that Wg is J^s-i-measurable, and {J2i=i[Pi — -^{-'^s+i >G + 
xq}]ws,J^s) is a martingale with bounded differences. Then by the Azuma- 
Hoeffding inequality [see, e.g., Cesa-Bianchi and Lugosi (2006)], 

{t t 
^[p'l - i{Xs+i > e + xo}]ws > p'l E ^« - ^Pi(^ - ^(*))' 
s=l s=l 

Y,ws>2zit-Nit))\ 

s=l J 

< - ^{^-+1 ^ ^ + ''o}]ws > zpiit - 7V(t))| 

<exp|-lzV(i-A^(i))}- 
Now we bound J2(t) as follows: 



J2it)=Fij2ws<2zit-Nit))^ 
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< 



^|E^{|CsI < ^o}I{s G T'} < 2z{t - iV(i))| 
"|E^{|CsI > ^o}I{s G T'} > (1 - 2z){t - N{t))^ 

4 U {ICs|>Xo}| 

ls=\(t~N(t))(l-2z)\ ) 



<it-N{t)) max P{|G|>2;o} 

t~N{t)<s<t 

< 2exp|^[l + q-^ln{u + l)]|ti-^o/(4g-^), 

where the last inequahty follows from (5.11). Combining this inequality with 
the upper bound on Ji{t) completes the proof. □ 

Proof of Theorem 2. 1*^. First we prove (3.15). By the premise of the 
theorem, 

(5.12) xll{V2a'^)>q. 

Put Tt = {1 < s < t : s ^ T}, and define A = {h(t) A 6*, hit) V 6*]. Then 

n 

By choice of the "forced" sampling sequence T in view of (3.11), we have 
that 

E[rinf (n)] < 1 + - ln(n + 1) + ^ P{Xt+i G A} 

= l + lln(n+l)+ V P{Xt+i G A,Ts(t)>pit/4} 
o - 

(5.13) 

+ 5^ P{Xi+i G A,T^(t) <Pit/4} 

<l + -ln(n + l) + z.o+ V Pi(t)+ V P2(i), 
where i^o is defined in (3.14). Applying Lemma 4 with 2 = 1/4, we have 

P2{t)<nTir{t)<Plt/4:} 



< 
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exp|-||(t-iV(t))} + 2exp|^[l + g-Mn(^ + l)]|ti--§/(4'''^^). 

< exp{-p2t/64} + 2 expj [1 + q'^ \n{u + 1)] |^-^ 

where the last inequahty follows from (5.12) and the fact that t — N{t) > t/2 
for t>vo- Hence, we get that 

(5.14) 

+ ^exp|^[l + g-iln(^ + l)]|. 

Now, turning to Pi{t), we have 

Piit) = F{Xt+i G Dt,Ti{t)>pit/i} 

= F{Xt+i G Dt,Tiit) > pit/i, le^it) -e\< xo} 

+ F{Xt+i G A, T^{t) > pit/i, le^it) -e\> xo} 

=:Ji{t) + J2{t). 
We first bound J2(t). Using Lemma 1, we have 

J2{t) < nTn{t) > Pit/ A, %{t) -e\> xo} < 2exp|-^i^| 
and, therefore, 



(5.15) j:,,«,<^2exp{-^} = ^-^ 



For Ji(t), we proceed as follows: 



Api/{^a^)y 



Ji{t) = ^P{Xt+iG A, T^(t)>pit/4, 2-^-^X0 <|4(t)-^|< 2-'=xo} 



fc=0 

oo 



= ^ E[J{T^(t) > pit/4, 2-^-^X0 < 1^* (t) - ^1 < 2-^xo} 
fe=o 

xF{Xt+i&Dt\Tt}] 

(5.16) 

(a) °° 

< ^ a[2^'=xo]"P{rs(t) > pit/4, |e^(t) - 0| > 2"^--ixo} 

k=0 

< EG42"'=xo]"2exp{-:^Lf 
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2C^x^ ^ 2-"^= exp 

k=0 



32a2 J ' 



where (a) follows from condition (2.3) and (b) follows from Lemma 1. 
For a, 6 > 0, set 



S{a, b) := 2""*^ exp{-62-2'=}, 



(5.17) 



fc=0 
/■oo 



poo 

I{a,b):= 2-"yexp{-b2~^y}dy. 
Jo 



Note that the integrand above has a unique (global) maximum at y* 
-(1/2) X log2(a/26), provided that a < 2b. Put k* := [y*] and write 



5(a,6) = 5]2-°'=exp{-62-2'=}+ ^ 2^"'^ exp{-62-2'=} 

fe=0 k=k*+l 



■■: Si{a,b) + 52(a, ?)). 



It follows that 



S2{a,b)< 



(a/2)°/2 



1-2-" 1-2- 

Since the integrand in (5.17) is monotone increasing on [0,y*), we have that 



Si{a,b)< r 2-''yexp{-b2-^y}dy<I{a,b) 
Jo 



1 

- — / z''-^ expi-bz"^} dz < 
ln2 Jo 



r(«/2) 

21n2 



where r(-) denotes the gamma function. Thus, we have shown that for all 
< a < 26 one has 



5(a,6) <6-"/2 



(a/2)°/2 ^ r(a/2) 



1-2-° ■ 21n2 

Now we apply this result with b = XQtpi/{32a'^) in order to bound J2(i) 
[see (5.16)]. In particular, for any t > Wa'^ a / (x^pi) , we have 

'32a2\-/2 



Ji(t) <6a 



tpi 



(a/2)«/2 ^ r(a/2) 



1 - 2- 



21n2 



and, hence, 



(5.18) y Ji(t)<l^ + 6C, 



(a/2)«/2 ^ r(a/2) 



1-2- 



21n2 



E 



32^2 \°/2 
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Combining (5.18) with (5.15), (5.14) and (5.13), we come to (3.15). 
Now consider the case of a > 2. Here we have 

E Ji(t)<2axo"^E2-"^-p|-m^ 



2" «fc 



x2pi/(32a2)}- 

If x§pi/(32o-2) > 1, then for ah k > ko = (21n2)~Mn(x§pi/32cr2) we have 
that 2-2'^xgpi/(32o-2) < 1. The last inequahty holds for ah k if xlpi/{32a^) < 
1. In both cases 

,5Ql-exp{-2-2fcx2pi/(32a2)} 
- 1 - 

fc=0 

oo r)—ak 

+ E 



,= L.oJ+i 2-^'^^§Pi/(32ct2) - l/2[2-2fc^2p^/(32a2)]2 

< ^ + ^ V 2-("-2)^ 

-(l-e-i)(l-2-) 



<7^ TTT^ ^ + 



/32a^\"/("-2) 



(l-e-i)(l-2-") l-22-«Vxgpi 

where in the second inequality we took into account that 2~2^XQPi/(32a"2) < 
1 for k > kf). Therefore, if a > 2, then 

(5.19) g^,,w<2C.4{^(5^) 

Combining (5.19) with (5.15), (5.14) and (5.13), we come to (3.16). 

2^. The proof of the second statement proceeds using almost identical 
arguments. We have 

Y,\Xt-e\I{iTt^nn 

< J2 \^t+i - e\i{Xt+i G A} + E 1^* - ^1 

< J2 \^t+i - 0\I{Xt+i e Dt,T^{t)>pit/A} 
t&f„ 
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+ \Xt+i-e\I{Ti{t)<pit/A} 

+ ( E \Xt-e\+Y.\^t- 

=: Ji(n) + J2(n) + J3(n). 

Because Px,y G 'P'ai^)j ^l^t — ^| < ^- Then, using the properties of the 
"forced" samphng sequence T, we have that 



E[J3(n)]<^ 



l + i/o + -ln(n + l) 



Now, since Xt+i is independent of Tii{t), arguing in the same way as in 
(5.14), we have 



E[J2(n)] <^ 
Furthermore, 



+ ^exp^-^[l + g-iln(z. + l)] 



1 - exp{-pl/64} 3 "^14(72 
nJi{n)] = J2^\Xt+i-9\I{Xt+ieDt,mt)>pit/A,\e^{t)-e\<xo} 

+ J2 nxt+i - e\i{Xt+i G Dt,n{t) > Pit/ A, le^it) -e\> xo} 

n— 1 71—1 



t=l 



t=l 



Using Lemma 1 and the independence of Xt-\-i from Tji{t),9ji^{t), we bound 
the second term as follows: 

E2it) = E\Xt+i - e\I{Xt+i G Dt,T^{t) > Pit/ A, - ^| > x^i] 

< nXt+i - e\I{Ti{t) > Pit/ A, \h{t) -9\> xo} 

Now, for the first term, write 

oo 

Ei{t) = Y^w.[\Xt+i-e\ 

k=0 

X I{Xt+i G Dt,T^{t) > Pit /A, 2-*^-ixo < \k{t) -9\< 2-*^xo}] 

oo 

= E W{Tn{t) > Pit/ A, 2~^~^xo < - ^1 < 

fc=0 
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x¥.{\Xt+i-9\I{Xt+ieDt}\Tt]] 



< J2 a[2~''xor+^F{T^{t) > Pit /A, %{t) -e\> 2-^~^xo} 



k=0 



(h) °° 
fe=0 




= 2^x^+^5(0 + 1,6), 



where, (a) follows from condition (2.3), (b) follows from Lemma 1, and 
with the notation used earlier, in part l'', the function S{-,-) is defined in 
(5.17) and b := XQPit/{32a'^). Using the bound on S{a,b) derived earlier, 
and making the substitution a (a + 1), we get 



Summing over t and using the bounds derived above on E2{t) together with 
the bounds established already on E[Ji(n)] and E[J2(n)], we obtain (3.17). 

If a > 1, then arguing as in the proof of (3.15) we arrive at the result 
stated in (3.18). 

5.4. Proof of Theorem 3. The proof relies on the following lemma. 
Lemma 5. Let (2.3) hold; then for any policy vr, one has 



Proof. Write, for brevity, dt{'K, vr*) = Plvr^ / vr^}. In order to underhne 
dependence of vrj on the observations yt~i = (Xi, vri, vrili, . . . , 7rt_i, vr^^i x 
Yf-i) and on the covariate value Xt, we will write vrt = 7rt(3^j_i; Xt). We write 

also 7T^ = TT^{Xt). 

Let r]t be a sequence of positive real numbers such that 7]t<XQ, Wt; then 




i?n(vr,7r*)> 



[5„(7r,7r*)]i+V"n-V' 



2max{(l/xo),(2C*)i/"}- 



n 



Rn{Tr,7T*) > Y.E\Xt - 9\I{7rt{yt^i;Xt) + T^l{Xt)\l{\Xt -Q\> 



t=\ 




n 



t=l 



(5.20) 
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/ F{7rt{yt^r,x) / 7T;{x)\Xt = x}Px{dx) 

J fx: \x-e\<m\ 



l{x:\x-e\<r,t} 
n 

>Y.m[dti7T,7r*) - Px{[e - Vt,0 + vt]}] 



t=i 

n 



>^r?i[di(7r,7r*)-a7?f], 
t=i 

where the last inequaUty follows from (2.3). 

Now we set x = max{2, l/(C*x[J)}, and r]t = [^^(vr, 7r*)/(xC*)]i/". With 
this choice 

< XQfif ^"(vr, vr*) < xo Vt and C*?yf < ^(it(7r, vr*). 
Then it follows from (5.20) that 

i2„,(vr,7r*)>-^^,d,(vr,vr*) = ^(^-^j _ 7r*)]i+V" 



2 VxC* 

where the last line follows from Jensen's inequality. □ 

Proof of Theorem 3. Fix 5 > 0, and let 6'(°) = and 9^^^ = 5. Note that 
when 6 = 9^^^ {9 = 9^^"^) it is preferable to sample from the arm 1 when x > 
{x>5). Thus, 7r*{9^°\x)^Tr*{9^^\x) only when x e (0, 5). 

Choose the probability density fx,s of X so that 

Suppose also that 6 is small enough so that C^Xq + C*((5/2)" < 1; then 
fx,5 can be indeed continued outside the interval [— xo,xo + 6] so that it 
is a probability density. Clearly, the joint distributions Px,y of X and Y, 
corresponding to 9 = 0(o) ^ = ^(i)) fx,5 belong to Va- 
Therefore, we have 

n 

SniTr;Va)> sup ^P^vr^/vr*} 
6»e{e(o),6Ki)}t=i 

1 " 

^ i=i 

> / Kio){Myt~i;x)^<{x)\Xt = x} 

+ P,(i){7rt(3^t_i;x) / 7T*{x)\Xt = x}]fxA^) dx, 
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where we have used the fact that Xt is independent of yt~i- Here and in 
the sequel, Pg(i) denotes the probabihty measure w.r.t. the distribution of 
observations Vt-i when 9 = d^^\ 

Now for fixed t consider the problem of testing the hypothesis Hq : 6 = 9^^^ 
versus Hi : 9 = 9^^^ from the observations yt-i collected under the policy vr. 
Consider the following test: given observations yt-i, the statistic 7rt(3^j_i; x) 
is computed for given x S (0,5), and it is compared with 7r^{9^^\x) and 
■7rf{9^^\x). The hypothesis Hq is rejected when 7rt(3^t_i;x) / Tr^{9^'^\x). Be- 
cause ttT 

(0(0) ^ x) ^ Ti* (0(1) , , Vx G (0, 5) , the expression under the integral 
sign in the last displayed formula above represents the sum of the error prob- 
abilities of the described test. Using well-known inequalities on error proba- 
bilities in testing problems [see, e.g., Devroye (1987) or Tsybakov (2004b)], 
we obtain that for any fixed x G (0, 6) 

Fe{o){^t(3^t-i;x) /vrnx)|Xt =x} +P,a){7rt(Jt_i;x) ^<(x)|Xj = x} 

>iexp{-/C{P,(„)(3^t_i),Pea)(3^t_i)}}, 

where /C{-, •} is the Kullback-Leibler divergence between distributions of the 
observations yt~i under Hq and Hi. A straightforward calculation shows 
that 

/C(P,(o)(3^t-i),Pe(i)(3^t-i)) 

( I t~i , t-i >| 

\ s=l s=l ) 

<^E,(o)[r^(t-i)], 



so that 



1 " fd 

Sn{7r;Va) > -Eexp{-/C(Pe(o)(3^t_i),Pe(i)(3^i-i))} / fxA^)d^ 
8 Jq 

> ia(WEexp|-^E,(o)T^(t-l) 



> ^C*(V2)"nexp 



2cj2 /• 

Maximizing the RHS with respect to 6, we set 6 = 6^, = \faan''^l'^ . This 
yields 

5n(vr;P„)>^(^)"^'acT"ni--/2, 
as was claimed. The lower bound on Rn{T:\V'^ follows from Lemma 5. 
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5.5. Proof of Theorem 4- 1 • We start with the proof of the lower bound 
on the regret. 

Let vr be an arbitrary pohcy from IT. Without loss of generality, we assume 
that xq = 1/2 in the definition of the class Vi{6). For every fixed S let 
Xt be a random variable uniformly distributed on A:=[6 — 1/2,9 + 1/2], 
that is, fx,eix) =lAix)- Clearly, the corresponding joint distribution Px,y 
of (X, y) belongs to the class 'Pi{6) (see Definition 1). For any fixed 9 and 
fx = fx,e, we have 

n 

Rnin, TT*) =Ej2\Xt- 9\I{7rt + <} 
t=\ 

n 

= E^|Xt-e|/{XtG [7tA0,7tV0]} 
t=\ 

= / \x- 9\Ia{x) dx = -EY, lit - 

t=l •^'''t^^ ^ t=l 

where 7t = max{7f, ^ +1/2} or 7^ = min{7f, ^ — 1/2}, depending on either 
7t > or 7t < 9. Thus, the problem is reduced to establishing a lower bound 
on the maximal cumulative squared error for estimating parameter € 0. 

Let A be a probability distribution on G with density A w.r.t. the Lebesgue 
measure. We assume that A converges to zero at the endpoints on the interval 
0, and the Fisher information 1(A) for the location parameter in A is positive 
and finite. The minimax cumulative risk in estimating 9 is lower bounded 
by the Bayesian risk as follows: 

n I, n 

(5.21) infsupEV|7j-^|2>inf / Ey^\^t - 9\^ \{9) d9 , 

7 see 7 J 

where inf is taken over all sequences 7 = {jt^t^ 1) such that 7^ is -mea- 
surable. Let = <t(Xi, . . . ,Xt_i, Yi, . . . ,Yt-i); because J-'t-i C 
expression on the RHS of (5.21) is lower bounded by inf^jKJ2t=i lit — 
where inf is taken over all sequences 7 = (74) such that 7t is ^j*_^-measurable. 
Thus, we have 



(5.22) 




where 7t is .7"j*_]^-measurable. Thus, the problem is reduced to establishing 
a lower bound on the Bayesian risk in the problem of estimating the scalar 
parameter 9 & @ from observations {{Xs,Ys), s = 1, . . . ,t — 1}, where Ys = 
Xg — 9 + Eg, and are i.i.d. zero mean Gaussian random variables with 
variance ci^. This problem is well studied, and there are different methods for 
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establishing such lower bounds [see, e.g., Borovkov and Sakhanenko (1980), 
Brown and Gajek (1990) and Gih and Levit (1995)]. 

In particular, by the van Trees inequality [see Gill and Levit (1995)], 



inf / Eht - dfX(e) de > 



it+i{\y 

where It is the expected Fisher information for 9 associated with the con- 
ditional density of observations {Xi^Yi, . . . ^Xt^i^Yt-i) given 6; and /(A) is 
the Fisher information for the location parameter in A. Thus, 



2 



t-1 



2 



t-1 



The standard choice of A is the following: 

A(^) = -^Ao (^^) , MO) = cos2(7r0/2)/{|0| < 1}, 

where tq is the center of the interval G := [r~,r"'"], and /i = r"*" — r^. With 
this choice /(Aq) = vr^ and 1(A) = h~^I{Xo) = vr^/i"^. Therefore, applying 
the van Trees inequality for each summand in (5.22), we obtain 

n 1 n -. 

i2„(vr;Pi(e)) > ^ = a^Y. , i^ 2 2.-2 ^ ^^'l^^ 

for n large enough. 

2*^. The lower bound on 5„(7r; 7^2(0)) follows from identical considera- 
tions. In this case we choose fx to be linear in the vicinity of 6; then for 
any policy vr G 11, 5„(7r,7r*) > cEJ2t=i lit — G\'^ for any sequence 7 = (74) of 
random variables such that 74 is J^t_i-measurable. 
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